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D 1 Abstract 

Q 

We investigate a curious relationship between the structure of a discrete graphical 
model and the support of the inverse of a generalized covariance matrix. We show that 
for certain graph structures, the support of the inverse covariance matrix of indicator 
I 1 1 variables on the vertices of a graph reflects the conditional independence structure of the 

graph. Our work extends results that have previously been established only in the context 
of multivariate Gaussian graphical models, thereby addressing an open question about the 
significance of the inverse covariance matrix of a non-Gaussian distribution. The proof 
exploits a combination of ideas from the geometry of exponential families, junction tree 
theory, and convex analysis. These population-level results have various consequences for 
graph selection methods, both known and novel, including a novel method for structure 
estimation for missing or corrupted observations. We provide non-asymptotic guarantees 
for such methods, and illustrate the sharpness of these predictions via simulations. 
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1 Introduction 



Graphical models are used in many application domains, running the gamut from computer 
vision and civil engineering to political science and epidemiology. In many applications, 
" estimating the edge structure of an underlying graphical model is of significant interest. For 

instance, a graphical model may be used to represent friendships between people in a social 
network [3] or links between organisms with the propensity to spread an infectious disease [24] . 
It is a classical corollary of the Hammersley-Clifford theorem [13, 5, 19] that zeros in the 
inverse covariance matrix of a multivariate Gaussian distribution indicate absent edges in 
the corresponding graphical model. This fact, combined with various types of statistical 
estimators suited to high dimensions, has been leveraged by many authors to recover the 
structure of a Gaussian graphical model when the edge set is sparse (e.g., see the papers [8, 
23, 26, 32] and references therein). Recently, Liu et al. [20, 21] introduced the notion of 
a nonparanormal distribution, which generalizes the Gaussian distribution by allowing for 
monotonic univariate transformations, and argued that the same structural properties of the 
inverse covariance matrix carry over to the nonparanormal. 

However, for general non-Gaussian graphical models, the question of whether a relation- 
ship exists between conditional independence and the structure of the inverse covariance 
matrix remains unresolved. In this paper, we establish a number of interesting links between 
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covariance matrices and the edge structure of an underlying graph in the case of discrete- 
valued random variables. (Although we specialize our treatment to multinomial random 
variables due to their widespread applicability, several of our results have straightforward 
generalizations to other types of exponential families.) Instead of only analyzing the stan- 
dard covariance matrix, we show that it is often fruitful to augment the usual covariance 
matrix with higher-order interaction terms. Our main result has a striking corollary in the 
context of tree-structured graphs: for such models, the inverse of a generalized covariance 
matrix is always (block) graph-structured. In particular, for binary variables, the inverse of 
the usual covariance matrix may be used to recover the edge structure of the tree. We also 
establish more general results that apply to arbitrary (non-tree) graphs, specified in terms of 
graph triangulations. This more general correspondence exploits ideas from the geometry of 
exponential families [7, 31], as well as the junction tree framework [18, 19]. 

As we illustrate, these population-level results have a number of corollaries for graph 
selection methods. Graph selection methods for Gaussian data include neighborhood regres- 
sion [23, 34] and the graphical Lasso [12, 26, 29, 11], which corresponds to maximizing an 
.^-regularized version of the Gaussian likelihood. Alternative methods for selection of discrete 
graphical models include the classical Chow-Liu algorithm for trees [9]; techniques based on 
conditional entropy or mutual information [2, 6]; and nodewise logistic regression for discrete 
graphical models with pairwise interactions [16, 27]. Our population-level results imply that 
minor variants of the graphical Lasso and neighborhood regression methods, though originally 
developed for Gaussian data, remain consistent for trees and the broader class of graphical 
models with singleton separator sets. They also convey a cautionary message, in that these 
methods will be inconsistent (generically) for other types of graphs. We also describe a new 
method for neighborhood selection in an arbitrary sparse graph, based on linear regression 
over subsets of variables. Although suitable only for bounded degree graphs, it handles the 
case of noisy or missing data in a seamless manner. 

The remainder of the paper is organized as follows: In Section 2, we provide brief back- 
ground and notation on graphical models and describe the classes of augmented covariance 
matrices we will consider. In Section 3, we state our main population-level result (Theorem 1) 
on the relationship between the support of generalized inverse covariance matrices and the 
edge structure of a discrete graphical model, and then develop a number of corollaries. The 
proof of Theorem 1 is provided in Section 3.4, with proofs of the more technical results deferred 
to the appendices. In Section 4, we develop consequences of our population-level results in 
the context of specific methods for graphical model selection. We provide simulation results 
in Section 4.4 in order to confirm the accuracy of our theoretically-predicted scaling laws, 
dictating how many samples are required (as a function of graph size and maximum degree) 
to recover the graph correctly. Section 5 is devoted to proofs of our sample-based results, 
again with more technical results appearing in the appendices. 



2 Background and problem setup 

In this section, we provide background on graphical models and exponential families. We then 
work through a simple example that illustrates the phenomena and methodology studied in 
this paper. 
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2.1 Undirected graphical models 

An undirected graphical model or Markov random field (MRF) is a family of probability 
distributions respecting the structure of a fixed graph. We begin with some basic graph- 
theoretic terminology. An undirected graph G = (V, E) consists of a collection of vertices 
V = {1, 2, . . . , p} and a collection of unordered 1 vertex pairs E C V x V. A vertex cutset is a 
subset U of vertices whose removal breaks the graph into two or more nonempty components; 
see Figure 1(a) for an illustration. A clique is a subset C C V such that (s,t) G E for all 
distinct s, t G C. Any singleton ^4 = {s} is trivially a clique, but it may not be maximal. 
The cliques shown in Figure 1(b) are all maximal, meaning they are not properly contained 
within any other clique. For s G V, we define the neighborhood N(s) := {t G V | (s, t) G E} 
to be the set of vertices connected to s by an edge. 

For a fixed undirected graph G, we associate to each vertex s G V a random variable X s , 
taking values in some space X. For any subset A C 1/, we define the convenient shorthand 

:= s £ A}, and for three subsets of vertices, A, B and U, we write Xa -LL | to 
mean that the random vector Xa is conditionally independent of Xb given Xjj. The notion 
of a Markov random field may be defined in two essentially equivalent ways: either in terms 
of certain Markov properties indexed by vertex cutsets, or in terms of a factorization property 
described by the graph cliques. 




Figure 1. (a) Illustration of a vertex cutset: when the set U is removed, the graph breaks into 
two disjoint subsets of vertices A and B. (b) Illustration of maximal cliques, corresponding to 
fully-connected subsets of vertices. 



Definition 1 (Markov property). We say that the random vector X := (X±, . . . ,X p ) is 
Markov with respect to the graph G if Xa -LL Xb \ Xjj whenever U is a vertex cutset that 
breaks the graph into disjoint subsets A and B. 

As an important special case, the neighborhood set N(s) is always a vertex cutset for A = {s} 
and B = V\{s U N(s)}. Consequently, whenever X is Markov with respect to G, we have 
the conditional independence property X s _LL %\{ sU jv(s)} I ^n(s)- This property plays an 
important role in neighborhood-based methods for graphical model selection that we will 
discuss later. 

The factorization property is defined directly in terms of the probability distribution q of 
the random vector X. For each clique C, a clique compatibility function ipc is a mapping 

1 No distinction is made between the edge (s,t) and the edge (t,s). In this paper, we forbid graphs with 
self-loops, meaning (s, s) $5 E for all s € V. 
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from configurations xq = {x s ,s G V} of variables to the positive reals. Let C denote the set 
of all cliques in G. 

Definition 2 (Factorization property). The distribution of X factorizes according to G if it 
may be represented as a product of clique compatibility functions: 

q(x u ...,x p ) oc Y[ 4>c(xc)- (1) 
cec 

Without loss of generality, the factorization may always be restricted to maximal cliques of 
the graph, but it may be convenient for interpretability to include terms for non-maximal 
cliques. 



2.2 Graphical models and exponential families 

By the Hammersley-Clifford theorem [5, 13, 19], the Markov and factorization properties are 
equivalent for any strictly positive distribution. We focus on such strictly positive distributions 
throughout this paper, in which case the factorization (1) may alternatively be represented in 
terms of an exponential family associated with the clique structure of G. We begin by defining 
this exponential family representation for the special case of binary variables (X = {0,1}), 
before discussing a natural generalization to m-ary discrete random variables. 



Binary variables: For a binary random vector X G {0, 1} P , we associate with each clique 
C — both maximal and non-maximal — a sufficient statistic lc( x c) := llsec x s- Our choice of 
notation reflects that fact that Ic(xc) = 1 if and only if x s = 1 for all s G C, so it is an 
indicator function for the event {x s = 1, Vs G C}. In the exponential family, this sufficient 
statistic is weighted by a natural parameter 8c G M, and we write the factorization (1) in the 
form 

q e (x 1 ,...,x p ) = exp{Y,8cMxc)-H0)}, (2) 

Cec 

where := log J2xe{o i}p ex P(SceC ®c^c{xc)) is the log normalization constant. It may 
be verified (cf. Lemma 1 below) that the factorization (2) defines a minimal exponential family, 
meaning the sufficient statistics {Ic(xc), C G C} are affinely independent. In the special case 
of interactions that are at most pairwise, equation (2) reduces to the classical Ising model: 

qejxi, . . . , x p ) = exp { }^ 9 s x s + ^ st x s x t - ®{0)}. (3) 
sev (s,t)eE 

The model (3) is a particular instance of a pairwise Markov random field. 



Multinomial variables: In order to generalize the Ising model to non-binary variables — 
say X = {0, 1, ...,m— 1} — we introduce a larger set of sufficient statistics. Let us first 
illustrate this extension in the special case of a pairwise Markov random field. For each node 
s G V and configuration j G Xq := Af\{0} = {1, 2, . . . , m — 1}, we introduce the binary-valued 
indicator function 

U otherwise. 
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We also introduce a vector 9 S = {O s j,j G Xq} of natural parameters associated with these suf- 
ficient statistics. Similarly, for each edge (s,t) G E and configuration (j,k) G A'q := Afo x Ao, 
we introduce the binary- valued indicator function l s t;jk for the event {x s = j, x t = k}, as well 
as the collection 9 s t '■= {9 s tjk,{j,k) G Xq} of natural parameters. With this notation, any 
pairwise Markov random field over m-ary random variables may be written in the form 

q e (xi,. . . ,x p ) = exp{ J2(6 S , I s {x s )) + ^ (9 st , I st (x s ,x t )) - $(9)} , (5) 
s£V (s,t)€E 

where we have introduced the convenient shorthands (9 S , I s (x s )) := Y^j=i @s;j^s;j(%s) and 
(9 s t, I s t(x s , x t )) := Yl, r j l k=i^st;jk^-st;jk{x s ,xt). As stated in Lemma 1 below, the factoriza- 
tion (5) defines a minimal exponential family with dimension equal to |V|(m — \) + \E\(m — l) 2 . 
Note that the family (5) is a natural generalization of the Ising model (3); in particular, for 
the special case m = 2, we have a single sufficient statistic I s; i(x s ) = x s for each vertex, and a 
single sufficient statistic I s t-u{x s , xt) = x s x t for each edge. (We have omitted the additional 
subscripts 1 or 11 in our earlier notation for the Ising model, since they are superfluous in 
that case.) 

Finally, for a graphical model involving higher-order interactions, we require additional 
sufficient statistics. For each clique C G C, we define the subset of configurations 

X l Cl := Xp X ■ ■ ■ X Xq = {( Js , S eC)£#l:j s /0 Vs G C}, 
C times 

a set of cardinality (m— 1)'^'. As before, C is the set of all maximal and non-maximal cliques. 
For any configuration J = {j s , s G C} G Xq , we define the corresponding indicator function 

I U otherwise. 

We then consider the general multinomial exponential family 

q g (x 1 ,...,x p ) =exp{ ^2(6 C , lc) -$(0)}, where x s G X = {0,1,... ,m- 1}, (7) 

Cec 

with (9c, Ic(xc)) = y\ c \ ®C;J^C;j{xc)- Note that all our previous models — namely the 

binary models (2) and (3), as well as the pairwise multinomial model (5) — are special cases 
of this general factorization. 

Recall that an exponential family is minimal if no nontrivial linear combination of sufficient 
statistics is almost surely equal to a constant. The family is regular if {9 : <&(9) < oo} is an 
open set. As will be relevant later, the exponential families described in this section are all 
minimal and regular. We summarize as follows: 

Lemma 1. The exponential family (7) is a minimal and regular family with dimension 
£ = £c 6 c(™-l) |C| - 

See Appendix A.l for the proof of Lemma 1. 
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2.3 Covariance matrices and beyond 



We now turn to a discussion of the phenomena that motivate the analysis of this paper. 
Consider the usual covariance matrix £ = cov(Xi, . . . , X p ). When X is jointly Gaussian, it is 
an immediate consequence of the Hammer sley- Clifford theorem that the sparsity pattern of the 
precision matrix T = S _1 reflects the graph structure — that is, T s t = whenever (s, t) ^ E. 
More precisely, Y st is a scalar multiple of the correlation of X s and X t conditioned on X\{ st j 
(cf. Lauritzen [19]). For non-Gaussian distributions, however, the conditional correlation will 
be a function X\ s s n , and it is unknown whether the entries of T have any relationship with 
the strengths of correlations along edges in the graph. 

Nonetheless, it is tempting to conjecture that inverse covariance matrices, or some variant 
thereof, might be related to graph structure in the non-Gaussian case. We will explore this 
possibility by considering a simple case of the binary Ising model (3). 



Example 1. Consider a simple chain graph on four nodes, as illustrated in Figure 1(a). In 
terms of the factorization (3), let the node potentials be 9 S = 0.1 for all s G V and the edge 
potentials be 9 s t = 2 for all (s, t) G E. For a multivariate Gaussian graphical model defined 
on G, standard theory predicts that the inverse covariance matrix T = S" 1 of the distribution 
is graph-structured: T st = if and only if (s, t) ^ E. Surprisingly, this is also the case for the 
chain graph with binary variables: a little computation shows that T takes the form shown 
in panel (f). However, this statement is not true for the single-cycle graph shown in panel 
(b). Indeed, as shown in panel (g), the inverse covariance matrix has no nonzero entries at 
all. Curiously, for the more complicated graph in (e), we again observe a graph-structured 
inverse covariance matrix. 
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-3.59 9.80 



■ loop 



51.37 -5.37 -0.17 -5.37 

-5.37 51.37 -5.37 -0.17 
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Figure 2. (a)-(e) Different examples of graphical models, (f) Inverse covariance for chain 
graph in (a), (g) Inverse covariance for single-cycle graph in (b). 



Still focusing on the single-cycle graph in panel (b), suppose that instead of considering 
the ordinary covariance matrix, we compute the covariance matrix of the augmented random 
vector (Xi, X2, X3, X4, X1X3), where the extra term X1X3 is represented by the dotted edge 
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shown in panel (c) . The 5x5 inverse of this generalized covariance matrix takes the form 
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(8) 



This matrix safely separates nodes 1 and 4, but the entry corresponding to the phantom edge 
(1,3) is not equal to zero. Indeed, we would observe a similar phenomenon if we chose to 
augment the graph by including the edge (2,4) rather than (1,3). Note that the relationship 
between entries of T aug and the edge strength is not direct; although the factorization (3) has 
no potential corresponding to the augmented "edge" (1, 3), the (1, 3) entry of T aug is noticeably 
larger in magnitude than the entries corresponding to actual edges with nonzero potentials. 
This example shows that the usual inverse covariance matrix is not always graph-structured, 
but computing generalized covariance matrices involving higher-order interaction terms may 
indicate graph structure. 

Now let us consider a more general graphical model that adds the 3-clique interaction 
terms shown in panel (d) to the usual Ising terms. We compute the covariance matrix of the 
augmented vector 

^(X) = {Xi,X2,X3, Xa,XiX2, X2X3, X3X4, X1X4, X1X3, X1X2X3, X1X3X4} G {0, l} 11 . 

Empirically, we find that the 11 x 11 inverse of the matrix cov(*S>(X)) continues to respect 
aspects of the graph structure: in particular, there are zeros in position (a,/3), corresponding 
to the associated functions X a = Y\ sGa X s and Xp = \\ s& p Xp, whenever a and /3 do not lie 
within the same maximal clique. (For instance, this applies to the pairs (a, (3) = ({2}, {4}) 
and («,/3) = ({2},{l,4}).) 



The goal of this paper is to understand when certain inverse covariances do (and do not) 
capture the structure of a graphical model. The underlying principles behind the behavior 
demonstrated in Example 1 will be made concrete in Theorem 1 and its corollaries in the next 
section. 



3 Generalized covariance matrices and graph structure 

We now state our main results on the relationship between the zero pattern of generalized 
inverse covariance matrices and graph structure. In Section 4 to follow, we develop some 
consequences of these results for data-dependent estimators used in structure estimation. 

We begin with some notation for defining generalized covariance matrices, which are de- 
fined in terms of the sufficient statistics previously defined (6). Recall that any clique C G C 
is associated with the collection {Ic ; j, J £ Ajj } of binary- valued sufficient statistics. Let 
S C C, and define the random vector 

v(X;S) = {l C ;j,Je4 Cl i CeS}, (9) 

consisting of all the sufficient statistics indexed by elements of S. As in the previous section, 
C contains both maximal and non-maximal cliques. 
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We will often be interested in situations where 5 contains all subsets of a given set. For 
a subset A QV, let pow(A) denote the collection of all 2'^' — 1 nonempty subsets of A. We 
extend this notation to S by defining 

pow(5) := |^J pow(C). 
Ce-S 

3.1 Triangulation and block structure 

Our first main result concerns a connection between the inverses of generalized inverse covari- 
ance matrices associated with the model (7) and any triangulation of the underlying graph 
G. The notion of a triangulation is defined in terms of chordless cycles, which are sequences 
of distinct vertices {s±, . . . , s^} such that: 

• (sj, Sj+i) £ E for all 1 < i < £ — 1, and also (sg, s\) € E; 

• no other nodes in the cycle are connected by an edge. 

As an illustration, the 4-cycle in Figure 1(b) is a chordless cycle, 

Definition 3 (Triangulation). Given an undirected graph G = (V,E), a triangulation is an 
augmented graph G = (V, E) that contains no chordless cycles of length greater than 3. 

Note that any tree is trivially triangulated, since it contains no cycles. On the other hand, 
the chordless 4-cycle in Figure 1(b) is the simplest example of a non-triangulated graph. By 
adding the single edge (1,3) to form the augmented edge set E = E U {(1, 3)}, we obtain the 
triangulated graph G = (V, E) shown in panel (c). One may check that the more complicated 
graph shown in Figure 1(e) is triangulated, as well. 

Our first result applies to the inverse T of the covariance matrix cov(^(X; C)), where C is 
the set of all cliques arising from some triangulation G of G. (Our theory guarantees that the 
inverse T exists). For any two subsets A, B G C, we write T(A,B) to denote the sub-block 
of r indexed by all indicator statistics on A and B, respectively. (Note that we are working 
with respect to the exponential family representation over the triangulated graph G.) Given 
our previously-defined sufficient statistics (6), the sub-block T(A,B) has dimensions cLa x ds, 
where 

d A :=(m-l)\ A \ and d B := (m - 1) |B| . 

As a particular example, when A = {s} and B = {t}, the submatrix T(A, B) has dimension 
(m — 1) x (m — 1). With this notation, we have the following result: 

Theorem 1. [Triangulation and block graph-structure.] Consider an arbitrary discrete graph- 
ical model of the form (7), and let C be the set of all cliques in any triangulation of G. Then 
the generalized covariance matrix cov(^(X; C)) is invertible, and its inverse T is block graph- 
structured in the following sense: 

(a) For any two subsets A,B € C that are not subsets of the same maximal clique, the block 
T(A, B) is identically zero. 

(b) For almost all parameters 6, the entire block T(A,B) is nonzero whenever A and B 
belong to a common maximal clique. 
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In part (b), "almost all" refers to all parameters 9 apart from a set of Lebesgue measure 
zero. The proof of Theorem 1, which we provide in Section 3.4, relies on the geometry of expo- 
nential families [7, 31] and certain aspects of convex analysis [28], involving the log partition 
function <I> from equation (35) and its Fenchel-Legendre dual <£*. Although we have stated 
Theorem 1 for discrete variables, it is actually a more general result that holds for any class 
of random variables. The only difference is the specific choices of sufficient statistics used to 
define the generalized covariance matrix. This generality becomes apparent in the proof. 

To provide intuition for Theorem 1, let us describe the predictions it makes for specific 
graphs. Note that when the original graph G is a tree (such as the graph in Figure 1(a)), G 
is already triangulated, so the set C in Theorem 1 is equal to the edge set E, together with 
singleton nodes. Hence, Theorem 1 implies that the inverse T of the augmented covariance 
matrix with sufficient statistics for all vertices and edges is graph-structured, and blocks of 
nonzeros in T correspond to edges in the graph. In particular, we may apply Theorem 1(a) 
to the subsets A = {s} and B = {t}, where s and t are distinct vertices with (s,t) ^ E, and 
conclude that the (m — 1) x (m — 1) sub-block T(A, B) is equal to zero. 

When G is not triangulated, however, we may need to invert a larger augmented covari- 
ance matrix and include sufficient statistics over pairs (s, t) ^ E, as well. For instance, the 
augmented graph shown in Figure 1(c) is a triangulation of the chordless 4-cycle in panel 
(b). The associated set of maximal cliques is given by C = {(1,2), (2,3), (3,4), (1,4), (1,3)}; 
among other predictions, our theory guarantees that the generalized inverse covariance T will 
have zeros in the sub-block r({2}, {4}). 

3.2 Separator sets and graph structure 

In fact, it is not necessary to take the set of sufficient statistics over all maximal cliques, and 
we may consider a slightly smaller augmented covariance matrix. (It is this simpler type of 
augmented covariance matrix that explains the calculations given in Section 2.3.) In order to 
describe this simplification, we require the notion of a junction tree. 

By classical graph theory, any triangulation G gives rise to a junction tree representation 
of G. Nodes in the junction tree are subsets of V corresponding to maximal cliques of G, 
and the intersection of any two adjacent cliques C\ and C2 is referred to as a separator set 
S = C\ n C2. Furthermore, any junction tree must satisfy the running intersection property, 
meaning that for any two nodes of the junction tree — say corresponding to cliques C and 
D — the intersection C DD must belong to every separator set on the unique path between C 
and D. 

The following result shows that it suffices to construct generalized covariance matrices 
augmented by separator sets: 

Corollary 1. Let S be the set of separator sets in any triangulation of G, and let T be the 
inverse of cov(^(X; V U pow(<S))). Then r({s}, {t}) = whenever (s,t) £ E. 

Note that V U pow(5) C C, and the set of sufficient statistics considered in Corollary 1 is 
generally much smaller than the set of sufficient statistics considered in Theorem 1. Hence, 
the generalized covariance matrix of Corollary 1 has a smaller dimension than the generalized 
covariance matrix of Theorem 1 , which becomes significant when we consider exploiting these 
population- level results for statistical estimation. 

The graph in Figure 1(c) of Example 1 and the associated matrix in equation (8) provide 
a concrete example of Corollary 1 in action. In this case, the single separator set in the 
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triangulation is {1,3}, so when X = {0,1}, augmenting the usual covariance matrix with 
the additional sufficient statistic Ii3 ; ii(cci, x%) = X1X3 and taking the inverse yields a graph- 
structured matrix. Indeed, since (2,4) ^ E, we observe that r aug (2,4) = in equation (8), 
consistent with the result of Corollary 1. 

Although Theorem 1 and Corollary 1 are clean population-level results, however, forming 
an appropriate augmented covariance matrix requires prior knowledge of the graph — namely, 
which edges are involved in a suitable triangulation. This is infeasible in settings where the 
goal is to recover the edge structure of the graph. Corollary 1 is most useful for edge recovery 
when G admits a triangulation with only singleton separator sets, since then VL)pow(S) = V. 
In particular, this condition holds when G is a tree. The following corollary summarizes our 
result: 

Corollary 2. For any graph with singleton separator sets, the inverse V of the covariance 
matrix cov(^(X; V)) of vertex statistics is graph-structured. (This class includes trees as a 
special case.) 

In the special case of binary variables, we have $?(X;V) = (X±, . . . , X p ), so Corollary 2 
implies that the inverse of the ordinary covariance matrix cov(X) is graph-structured. For 
m-ary variables, cov(^(X; V)) is an (m — l)p x (m — l)p matrix involving indicator functions 
for each variable. Again, we may relate this corollary to Example 1 — the inverse covariance 
matrices for the tree graph in panel (a) and the dinosaur graph in panel (e) are exactly graph- 
structured. Indeed, although the dinosaur graph is not a tree, it possesses the nice property 
that the only separator sets in its junction tree are singletons. 

Corollary 1 also guarantees that inverse covariances may be partially graph-structured, in 
the sense that T({s}, {t}) = for any pair of vertices (s, t) separable by a singleton separator 
set, where F = (cov(\I/(X; V))) . This is because for any such pair (s,t), we may form a 
junction tree with two nodes, one containing s and one containing t, and apply Corollary 1. 
Indeed, the matrix T defined over singleton vertices is agnostic to which triangulation we 
choose for the graph. 

In settings where there exists a junction tree representation of the graph with only singleton 
separator sets, Corollary 2 has a number of useful implications for the consistency of methods 
that have traditionally only been applied for edge recovery in Gaussian graphical models. 
Indeed, Corollary 2 implies that for tree-structured discrete graphs, it suffices to estimate the 
support of (cov(^(X; V)))" 1 from the data. We will review methods for selection in Gaussian 
graphical models and describe their analogs to the case of discrete tree graphs in Sections 4.1 
and 4.2 to follow. 

3.3 Generalized covariances and neighborhood structure 

Theorem 1 also has a corollary that is relevant for nodewise neighborhood selection approaches 
to graph selection [23, 26] that are applicable to graphs with arbitrary topologies. Nodewise 
methods use the basic observation that recovering the edge structure of G is equivalent to 
recovering the neighborhood set N(s) = {t E V : (s,t) E E} for each vertex s E V. For a 
given node s E V and positive integer d, consider the collection of subsets 

S(s;d):={UOV\{s}, \U\=d}. 

The following corollary provides an avenue for recovering N(s) based on the inverse of a 
certain generalized covariance matrix: 
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Corollary 3. [Neighborhood selection] For any graph and any node s € V with degree at 
most d, the inverse T of the covariance matrix cov(^>(X; {s}Upow(5(s; of)))) is s-block graph- 
structured, meaning r({s},-B) = whenever {s} / BC N(s). In particular, r({s},{i}) = 
for all vertices t N(s). 

Note that pow(5(s; d)) is the set of subsets of all candidate neighborhoods of s of size d. 
This result follows from Theorem 1 (and the related Corollary 1) by constructing a particular 
junction tree for the graph, in which s is separated from the rest of the graph by N(s). Due 
to the well-known relationship between the rows of an inverse covariance matrix and linear 
regression coefficients (cf. [23] and Lemma 6 in the Appendix), Corollary 3 motivates the 
following neighborhood-based approach to graph selection: For a fixed vertex s £ V, perform 
a single linear regression of \I/(X;{s}) on the vector ^f(X;pow(S(s;d))). Via elementary 
algebra and an application of Corollary 3, the resulting regression vector will expose the 
neighborhood N(s) in an arbitrary discrete graphical model, in the sense that the indicators 
fy(X; {t}) corresponding to Xt will have a nonzero weight only if t € N(s). We elaborate on 
this connection in Section 4.2. 

3.4 Proof of Theorem 1 

Our proof is based on certain fundamental correspondences arising from the theory of expo- 
nential families [4, 7, 31]. Recall that our exponential family (7) has binary- valued indicator 
functions (6) as its sufficient statistics. Let D denote the cardinality of this set. In order 
to ease notation, we let I : X v — > {0, i} D denote the multivariate function that maps each 
configuration x 6 X p to the vector I(x) obtained by evaluating each of these D binary-valued 
indicator functions on x. Using this notation, our exponential family may be written in the 
compact form qg{x) = exp{(6>, I(x)) — <£(#)}, where 

(0, 1(x)> = J>o, Ic(*)> = E E e c,M,j{x c ). 

cec cec JeX \c\ 

From Lemma 1, we know that our exponential family is minimal. Since the domain of $ 
is all of R D , the following properties of $ follow from standard results [7, 31]: 

Lemma 2. The function <I> : M. D — > M is strictly convex and infinitely differentiable on M. D . 
Moreover, its derivatives correspond to cumulants, with 

V$(0) =E<j[I(X)] and V 2 $(#) = cav e {I{X)). (10) 

Here, Eq (respectively, covg) denotes the expectation (respectively, covariance) taken under 
the density qg. 

Our proof leverages the correspondence between $ and its conjugate dual function [28], 
defined in a variational manner as 

:= sup{(^, 0>-$(0)}. 

The function $* is always convex and takes values in M U {+00} . From known results for 
exponential families [31], the function <3?* is finite only for vectors /i E M> D belonging to the 
marginal polytope 

M := {/i G R p I 3 some density q s.t. ^^q(x)I(x) = fj,}. (11) 

X 
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In the case of discrete variables, this set is a polytope, because it is defined as the convex hull 
of the finite collection of vectors {I(x),x S X p }', see Wainwright and Jordan [31] for more 
details. 

The following lemma, proved in Appendix A. 2, provides a connection between the covari- 
ance matrix and the Hessian of 

Lemma 3. Consider a regular, minimal exponential family, and define fi = Eg [1(A)] for any 
fixed 9 e tt = {9 : $(0) < oo}. Then we have 

(cov fl {I(X)})- 1 =V 2 $*(/i). (12) 

Note that the minimality and regularity of the family implies that covg{I(X)} is strictly 
positive definite, so the matrix is invertible. 

Lemma 3 is the key to our proof: it relates the sparsity pattern of (cove{I(A)}) _1 to the 
conjugate dual. Accordingly, we now turn to a more explicit consideration of this conjugate 
dual <&*. As noted previously, 3>*(/i) < oo for all [i £ Ai. From Lemma 2, the gradient 
mapping V$ maps natural parameters 9 S MP to mean parameters fi = V<&(#) = ¥,g [1(A)] 
belonging to the interior int(TW). Moreover, for the minimal exponential family under con- 
sideration here, the pair ($, $*) is of the Legendre type [28], which ensures that the gradient 
mapping is invertible on int(A'f). 

For any fi G int(A^), let 9(/i) S M D denote the unique natural parameter 9 such that 
V<£(#) = [i. With this notation, it can be shown [31] that the (negative) dual value — 3>*(/i) 
is equal to the Shannon entropy of the distribution qg^y- 

-$*(//) = H(q e{fl) (x)) = - QeM(x)logq e{p) (x). (13) 

In general, this expression does not provide a straightforward avenue to computing V 2 ^*, 
since the mapping fj, \— > 9((i) may be extremely complicated. 

Distributions defined by triangulated graphs are an important exception to this rule; the 
triangulation condition in Theorem 1 guarantees that the conjugate dual function <I>* has an 
explicit closed- form representation in terms of the mean parameters /i. This relationship is 
made explicit via the junction tree theorem, as we now demonstrate. Given a junction tree, 
let (C, S), be the collection of maximal cliques and separator sets, respectively. From the 
junction tree theorem [18, 31, 17], the distribution q = qe^) may be factorized in the form 

q{x u ...,x p )= CeC 14 
Uses Qs(x S ) 

where qc and qs are the marginal distributions over maximal clique C and separator set S, 
respectively. Consequently, the entropy may be decomposed into the sum 

H (?) =-J2 lo § ?(*) = E H ^c) ~ E H («s), (15) 
xexp Cg( 7 ses 

where we have introduced the clique- and separator-based entropies 

Hs(qs) ■= ~ Y 1s(x s ) log qs(x s ) and H c (qc) ■= ~ ^ qc{x c ) log qc{xc)- (16) 
x s ex\ s \ x c &x\ c \ 
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Given our choice of sufficient statistics (6), we now show that the marginals qc and qs 
may be written explicitly as "local" functions of mean parameters associated with C and S, 
respectively. For each subset A (IV, let fiA G ( m — I)'" 4 ' be the associated collection of mean 
parameters, and define 

Mpow(A) ■= {VB :tt^BCA} 

to be the set of mean parameters associated with all nonempty subsets of A. Note that /U pow ( J 4) 

contains a total of ^fcHi ('fc')( m — -0* = — 1 parameters, corresponding to the number 
of degrees of freedom involved in specifying a marginal distribution over the random vector 
xa, with each variable taking m possible values. Moreover, we claim that fJ> pow (A) determines 
one and exactly one such marginal distribution qA- 

Lemma 4. For any marginal distribution qA in the mM'-dimensional probability simplex, 
there is a unique mean parameter vector /i pow ( A) and matrix Ma such that qA = Ma (fJ- pc ,w(A) ) • 

For example, in the case of a singleton A = {s}, we may view q s as an m- vector of numbers, 
and write 

Qs = [1 -Z)jLiMs;i) ■ ■ ■ , fJ>s;m] , (17) 

where fx s; j := Ee[I s; j(X)] = qg(x s = j) for each j = 1, 2, . . . , m — 1. Similarly, for any edge 
A = {s,t}, we may view q s j as a vectorized version of the matrix 

r -1 v— im- 1 v— \m— 1 -i 
1 — 2~,j,k=l ^st jk Mt;l — 2-fj=l Msty'l ' ' ' ^t;m ~ Z^j=l Ustjm 

Em— I 
k=l ^st;lk Msi;ll ' ' ' A*st;lm 

Em— 1 
fe=l /^st^fc ^st;21 - - - Mst;2m 

Em— 1 
fc = l Hst\mk l^st;m,l f^st;mm 

In Appendix A. 3, we prove Lemma 4 via an iterative extension of this basic procedure. 

We now combine the dual representation (13) with the decomposition of the entropy (15), 
along with the matrices {Ale, Ms} guaranteed by Lemma 4, to conclude that 

= Hc(M c (» pawi o)) ~ Yl Hs(M s ^ pow{s) )). (18) 

This expression for the dual function <&* is explicit enough to verify the claims in Theorem 1. 

Consider two subsets A,B G C that are not contained in the same maximal clique. Suppose 
A is contained within maximal clique C. Differentiating expression (18) with respect to fiA 
preserves only terms involving qc and qs, where S is any separator set such that A C S C C. 
Since B C C, we clearly cannot have B C. S. Consequently, all cross-terms arising from 
the clique C and its associated separator sets vanish when we take a second derivative with 
respect to \xb- We may repeat this argument for any other maximal clique C containing A 
but not B, thereby concluding that J^IL (/x) = 0. This proves part (a) of the theorem. 

Turning to part (b) of Theorem 1, note that if A and B are part of the same maximal 
clique, the expression obtained by taking second derivatives of the entropy results in an 
algebraic expression with only finitely many solutions in the parameters fi (consequently, also 
9). Hence, assuming the 9 , s are drawn from a continuous distribution, the corresponding 
values of the block r(^4, B) are a.s. nonzero. 

We prove Corollaries 1 through 3 in Appendix B. 
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4 Consequences for graph structure estimation 



Moving beyond the population level, we now state and prove several results concerning the 
statistical consistency of different methods — both known and some novel — for graph selection 
in discrete graphical models, based on i.i.d. draws from a discrete graph. For sparse Gaussian 
models, existing methods that exploit sparsity of the inverse covariance matrix fall into two 
main categories: global graph selection methods (e.g., [11, 12, 29, 26]) and local (nodewise) 
neighborhood selection methods [23, 34]. We divide our discussion accordingly. 

4.1 Graphical Lasso for singleton separator graphs 

We begin by describing how a combination of our population-level results and some concen- 
tration inequalities may be leveraged to analyze the statistical behavior of log-determinant 
methods for discrete graphical models with singleton separator sets, and suggest extensions 
of these methods when observations are systematically corrupted by noise or missing data. 
Given a p-dimensional random vector (X%, . . . , X p ) with covariance £*, consider the estimator 

G argmin{trace(£0) -logdet(G) + A n ^|0 si |}, (19) 

where £ is an estimator for £*. For multivariate Gaussian data, this program is an £±- 
regularized maximum likelihood estimate known as the graphical Lasso and is a well-studied 
method for recovering the edge structure in a Gaussian graphical model [3, 12, 33, 29]. Al- 
though the program (19) has no relation to the MLE in the case of a discrete graphical model, 
it might still be useful for estimating 0* := (S*)" 1 . Indeed, as shown in Ravikumar et al. [26], 
existing analyses of the estimator (19) require only tail conditions such as sub-Gaussianity 
in order to guarantee that the sample minimizer is close to the population minimizer. The 
analysis of this paper completes the missing link by guaranteeing that the population-level 
inverse covariance is in fact graph-structured. Consequently, we obtain the remarkable result 
that the program (19) — even though it is ostensibly derived from Gaussian considerations — is 
a consistent method for recovering the structure of any binary graphical model with singleton 
separator sets. 

In order to state our conclusion more precisely, let us introduce some additional notation. 
We consider a general estimate £ of the covariance matrix £ such that 



P[||£ - £*|| max > ^(£*)^^p] < cexp(-^(n,p)) (20) 

for functions ip and ip, where || • || max denotes the elementwise ^-norm. In the case of fully- 
observed i.i.d. data with sub-Gaussian parameter a 2 , where £ = - Yli=i x i x I ~ 1S the 
usual sample covariance, this bound holds with </?(£*) = o~ 2 and ip(n,p) = c'logp. 

As in past analysis of the graphical Lasso [26], we require a certain mutual incoherence 
condition on the true covariance matrix £* to control the correlation of non-edge variables 
with edge variables in the graph. Let T* = £* ® £*, where <S> denotes the Kronecker product. 
Then T* is a p 2 x p 2 matrix indexed by vertex pairs. The incoherence condition is given by 

max llr^r^)" 1 !!! < 1 - a, a G (0, 1], (21) 

e£i> c 

where S := {(s,t) : Q* st ^ 0} is the set of vertex pairs corresponding to nonzero entries of the 
precision matrix 0* — equivalently, the edge set of the graph, by our theory on tree-structured 
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discrete graphs. For more intuition on the mutual incoherence condition, see Ravikumar et 
al. [26]. 

With this notation, our global edge recovery algorithm proceeds as follows: 



Algorithm 1 (Graphical Lasso). 

1. Form a suitable estimate £ of the true covariance matrix S. 

2. Optimize the graphical Lasso program (19) with parameter A n , and denote the solution 



3. Threshold the entries of O at level r n to obtain an estimate of O*. 

It remains to choose the parameters (A n ,r n ). In the following corollary, we will establish 
statistical consistency of under the following settings: 



where a is the incoherence parameter in inequality (21) and c\,C2 are universal positive 
constants. The following result applies to Algorithm 1 with S equal to the sample covariance 
matrix and (A n ,r n ) chosen as in equations (22): 

Corollary 4. Consider an Ising model (3) defined by an undirected graph with singleton 
separator sets and with degree at most d, and suppose that the mutual incoherence condi- 
tion (21) holds. With n ^ d 2 logp samples, there are universal constants (c, c') such that with 
probability at least 1 — cexp(— d logp), Algorithm 1 recovers all edges (s,t) with |0^| > t/2. 

The proof is contained in Appendix D.l; it is a relatively straightforward consequence 
of Corollary 1 and known concentration properties of £ as an estimate of the population 
covariance matrix. Hence, if |0^| > r/2 for all edges (s,t) E E, Corollary 4 guarantees that 
the log-determinant method plus thresholding recovers the full graph exactly. 

In the case of the standard sample covariance matrix, a variant of the graphical Lasso has 
been implemented by Banerjee et al. [3]. Our analysis establishes consistency of the graphical 
Lasso for Ising models on single separator graphs using n ^ d 2 \ogp samples. This lower 
bound on the sample size is unavoidable, as shown by information-theoretic analysis [30], 
and also appears in other past work on Ising models [27, 16, 2]. Our analysis also has a 
cautionary message: the proof of Corollary 4 relies heavily on the population-level result in 
Corollary 2, which ensures that 0* is graph-structured when G has only singleton separators. 
For a general graph, we have no guarantees that 0* will be graph-structured (e.g., see panel 
(b) in Figure 1), so the graphical Lasso (19) is inconsistent in general. 

On the positive side, if we restrict ourselves to tree-structured graphs, the estimator (19) 
is attractive, since it relies only on an estimate £ of the population covariance £* that satisfies 
the deviation condition (20). In particular, even when the samples {xj}" =1 are contaminated 
by noise or missing data, we may form a good estimate X of £*. Furthermore, the program (19) 
is always convex regardless of whether £ is positive semidefinite (as may not be the case for 
missing/corrupted data). 



by©. 




(22) 
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As a concrete example of how we may correct the program (19) to handle corrupted data, 
consider the case when each entry of X{ is missing independently with probability p, and the 
corresponding observations Zi are zero-filled for missing entries. A natural estimator is 

S = (l E * M ~ JTTpJ^' (23) 

where -j- denotes elementwise division by the matrix M with diagonal entries (1 — p) and 
off-diagonal entries (1 — p) 2 , correcting for the bias in both the mean and second moment 
terms. The deviation condition (20) may be shown to hold w.h.p., where (p(T,*) scales with 
(1 — p) (cf. Loh and Wainwright [22]). Similarly, we may derive an appropriate estimator X 
and a corresponding version of Algorithm 1 in situations when the data are systematically 
contaminated by other forms of additive or multiplicative corruption. 

Generalizing to the case of m-ary discrete graphical models with m > 2, we may easily 
modify the program (19) by replacing the elementwise £i-penalty by the corresponding group 
^-penalty, where the groups are the indicator variables for a given vertex. Precise theoretical 
guarantees follow from results on the group graphical Lasso [15]. 



4.2 Consequences for nodewise regression in trees 

Turning to local neighborhood selection methods, let us recall the neighborhood-based method 
introduced by Meinshausen and Biihulmann [23] . In a Gaussian graphical model, the column 
corresponding to node s in the inverse covariance matrix T = S _1 is a scalar multiple of 
j3 = £r\ £\ S)S , the limit of the linear regression vector for X s upon X\ s . Based on n 
i.i.d. samples from a p-dimensional multivariate Gaussian distribution, the support of the 
graph may then be estimated consistently under the usual Lasso scaling n £3 dlogp, where 
d=\N{s)\. 

Motivated by our population-level results on the graph structure of the inverse covariance 
matrix (Corollary 2), we now propose a method for neighborhood selection in a tree-structured 
graph. Although the method works for arbitrary m-ary trees, we state explicit results only in 
the case of the binary Ising model to avoid cluttering our presentation. 

The method is based on the following steps. For each node s £ V, we first perform 
.^-regularized linear regression of X s against X\ s by solving the modified Lasso program 

^Garg min {Vf/3 - 7 T /3 + A„||/3||i}, ( 24 ) 

where bo > ||/3||i is a constant, (r,7) are suitable estimators for (E\ s \ s , S\ S|S ), and A n is an 
appropriate parameter. We then combine the neighborhood estimates over all nodes via an 
AND operation (edge (s,t) is present if both s and t are inferred to be neighbors of each 
other) or OR operation (at least one of s or t is inferred to be a neighbor of the other). 

Note that the program (24) differs from the standard Lasso in the form of the ^-constraint . 
Indeed, the normal setting of the Lasso assumes a linear model where the predictor and 
response variables are linked by independent sub-Gaussian noise, but this is not the case for 
X s and X\ s in a discrete graphical model. Furthermore, the generality of the program (24) 
allows it to be easily modified to handle corrupted variables via an appropriate choice of 
(r,7), as in Loh and Wainwright [22]. 

The following algorithm summarizes our nodewise regression procedure for recovering the 
neighborhood set N(s) of a given node s: 
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Algorithm 2 (Nodewise method for trees). 

1. Form a suitable pair of estimators (r, 7) for covariance submatrices (E\ s \ s , E\ s>s ). 

2. Optimize the modified Lasso program (24) with parameter A n , and denote the solution 
by^. 

3. Threshold the entries of f3 at level r n , and define the estimated neighborhood set N{s) 
as the support of the thresholded vector. 

In the case of fully-observed i.i.d. observations, we choose (r, 7) to be the recentered 
estimators 

,~ „ f X T X rp X T y \ , . 

r, 7 = xx T , --yx), 25 

\ n n J 

and assign the parameters (A n ,r n ) according to the scaling 

^£<p\\M*f^, r n ^p\\P\\ 2 ^^, (26) 

where {3 := E" 1 Cov(xj, yi) and <p is some parameter such that (xi, u) is sub-Gaussian with 
parameter v^lMli f° r an y d-sparse vector u, and ip is independent of u. The following result 
applies to Algorithm 2 using the pairs (T, 7) and (A n , r n ) defined as in equations (25) and (26), 
respectively. 

Proposition 1. Suppose we have i.i.d. observations {(xi,yi)}f = i generated from an Ising 
model. If n £3 (p 2 max | A y ll^i 1 |||L} ^ ^°§P' then there are universal constants (c, c' , c") 
such that with probability at least 1 — cexp(— d logp), Algorithm 2 recovers all neighbors 
t G N(s) for which |A| > c" ^\\ 2 ^^. 

We prove this proposition in Section 5, as a corollary of a more general theorem on the 
•^oo-consistency of the program (24) for estimating /3, allowing for corrupted observations. The 
theorem builds upon the analysis of Loh and Wainwright [22], introducing techniques for f.^- 
bounds and departing from the framework of a linear model with independent sub-Gaussian 
noise. 

Remarks. Regarding the sub-Gaussian parameter ip appearing in Proposition 1, note that we 
may always take ip = y/d, since \xju\ < ||u||i < Va||tt|| 2 when u is ci-sparse and x% is a binary 
vector. This leads to a sample complexity requirement of n £3 <i 3 logp, which is comparable 
to the scaling required by Ravikumar et al. [27] for methods based on logistic regression. 
However, we suspect that a tighter analysis, possibly combined with further assumptions 
about the correlation decay of the graph, would reduce the sample complexity to the order 
of n ^ (Plogp required by other authors [16, 2], albeit under stronger assumptions. See the 
simulations in Section 4.4 for further discussion. 

Finally, note that for corrupted observations, the strength and type of corruption en- 
ters into the factors (pi, ^2) appearing in the deviation bounds (32a) and (32b) below, and 
Proposition 1 has natural extensions to the corrupted case. 
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In the case of m-ary tree-structured graphical models with m > 2, we may perform mul- 
tivariate regression with the multivariate group Lasso [25] for neighborhood selection, where 
groups are defined (as in the log-determinant method) as sets of indicators for each node. The 
general relationship between the best linear predictor and the block structure of the inverse 
covariance matrix is provided in Lemma 6 in Section 5. Hence, from a population-level per- 
spective, it suffices to perform a multivariate linear regression of all indicators corresponding 
to a given node against all indicators corresponding to other nodes in the graph. The resulting 
vector of regression coefficients then has nonzero blocks corresponding to edges in the graph. 
We may also combine these ideas with the group Lasso for multivariate regression [25] to 
reduce the complexity of the algorithm. 

4.3 Consequences for nodewise regression in general graphs 

Moving on from tree-structured graphical models, our method suggests a graph recovery 
method based on nodewise linear regression for general discrete graphs. Note that by Corol- 
lary 3, the inverse of cov(\P(X; pow(5(s; d)))) is s-block graph-structured, where d is such 
that |-ZV(s) | < d. Using the same idea based on Lemma 6 as in the case of nodewise regression 
for trees, it suffices to perform a single multivariate regression of the indicators *&(X; {s}) 
corresponding to node s upon the other indicators in ^(X; V U pow(5(s; d))). 

We again make precise statements for the binary Ising model (m = 2). In this case, the 
indicators $?(X; pow(f/)) corresponding to a subset of vertices U of size d' are all 2 d ' - 1 
distinct products of variables X u , for u € U. Hence, to recover the d neighbors of node s, 
we use the following algorithm. Note that knowledge of an upper-bound d is necessary for 
applying the algorithm. 

Algorithm 3 (Nodewise method for general graphs). 

1. Use the modified Lasso program (24) with a suitable choice of (r,7) and regulariza- 
tion parameter X n to perform a linear regression of X s upon all products of subsets of 
variables of X\ s of size at most d. Denote the solution by (3. 

2. Threshold the entries of (3 at level r n , and define the estimated neighborhood set N(s) 
as the support of the thresholded vector. 

Our theory states that at the population level, the nonzeros in the linear regression vector 
correspond exactly to subsets of N(s). Hence, the statistical consistency result of Proposition 1 
carries over with minor modifications. Since Algorithm 3 is essentially a version of Algorithm 4 
with the first two steps omitted, we refer the reader to the statement and proof of Corollary 5 
below for precise mathematical statements. Note here that since the regression vector has 
0(p d ) components, 2 d — 1 of which are nonzero, the sample complexity of Lasso regression in 
step (1) of Algorithm 3 is 0(2 d \og{p d )) = 0(2 d • logp). 

For graphs exhibiting correlation decay [6], we may reduce the computational complexity 
of the nodewise selection algorithm by prescreening the nodes of V\s before performing a 
Lasso-based linear regression. We define the nodewise correlation according to 

r c (s,t) := {Pi** = x„Xt = xt)-P(X 3 = x s )F(X t = x t )\, 

and say that the graph exhibits correlation decay if there exist constants £, k > such that 
r c (s,t) > k V(s, t) G E, and r c (s, t) < exp(-Cr(s, t)) V(s,i)eUxU, (27) 
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where r(s,t) is the length of the shortest path between s and t. With this notation, we then 
have the following algorithm for neighborhood recovery of a fixed node s in a graph with 
correlation decay: 



Algorithm 4 (Nodewise method with correlation decay). 

1. Compute the empirical correlations 

? c (s,t) := $(Xs = x s ,X t = x t )-P(X s = x s f(X t = x t )\ 

between s and all other nodes t S V, where P denotes the empirical distribution. 

2. Let C := {t G V : rc(s,t) > k/2} be the candidate set of nodes with sufficiently high 
correlation. (Note that C is a function of both s and k, and by definition, s £ C.) 

3. Use the modified Lasso program (24) with parameter A n to perform a linear regression 
of X s against Q := *f?(X; V U pow(C(s; d)))\{X s }, the set of all products of subsets of 
variables {X c : c € C} of size at most d, together with the singleton variables. Denote 
the solution by f3. 

4. Threshold the entries of (3 at level r n , and define the estimated neighborhood set N(s) 
as the support of the thresholded vector. 

Note that we may view Algorithm 3 as a version of Algorithm 4 with C = V\s, indicating the 
absence of a prescreening step. Hence, the statistical consistency result below applies easily 
to Algorithm 3 for graphs with no correlation decay. 

For fully-observed i.i.d. observations, we choose (r,7) according to 

(f , 7) = ~ ^\ ^ - yxc) , (28) 

\ n n J 

and the parameters (A n ,r n ) as follows: For a candidate set C, let xq^ & {0, l}\ Cd \ denote the 
augmented vector corresponding to the observation Xj, and define Sc := Cov(xc,i, xc,i)- Let 
P := S^" 1 Cov(xc y i, Vi). Then set 

KtmkJ^S, r n >,M 2 J^S, (29) 
V n V n 

where <p is some function such that (xc t i, u) is sub-Gaussian with parameter V 92 !! - "!!! f° r an y 
(2 d — l)-sparse vector u, and ip does not depend on u. We have the following consistency 
result, the analog of Proposition 1 for the augmented set of vectors. Again, the yts denote 
observations at a given node s and Xj's denote observations from all other nodes. It applies 
to Algorithm 4 with the pairs (r,7) and (A n ,r n ) chosen as in equations (28) and (29). 

Corollary 5. Consider i.i.d. observations {xi, yi}f = i generated from an Ising model satisfying 
the correlation decay condition (27), and suppose that 

n fc (*' + ^^{j-^-j, 2-) log |Q|. (30) 

Then there are universal constants (c, c', c") such that with probability at least 1— cexp(— d logp): 
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log(4/«) 

(i) The set C from step (2) of Algorithm 4 has cardinality at most \C\ < d c 



(ii) Algorithm 4 recovers all neighbors t 6 N(s) such that \f3 t \ > c"<£>||/3||2 



logjgdj 
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The proof of Corollary 5 is contained in Appendix D.2. Due to the exponential factor 2 d 
appearing in the lower bound (30) on the sample size, this method is suitable only for bounded- 
degree graphs. However, for reasonable sizes of d, the dimension of the linear regression 



impact on the runtime of the algorithm. We explore two classes of bounded-degree graphs 
with correlation decay in the simulations of Section 4.4, where we generate Erdos-Renyi graphs 
with edge probability c/p and square grid graphs in order to test the behavior of our recovery 
algorithm on non-trees. 

When m > 2, corresponding to non-binary states, we may combine these ideas with the 
overlapping group Lasso [15] to obtain similar algorithms for nodewise recovery of non-tree 
graphs. However, the details are more complicated, and we do not include them here. Note 
that our method for nodewise recovery in non-tree graphical models may again be easily 
adapted to handle noisy and missing data. 

4.4 Simulations 

Figures 3 and 4 depict the results of simulations we performed to test our theoretical predic- 
tions. We ran simulations for the graphical Lasso method and nodewise regression methods 
on both tree-structured and non-tree graphs with data generated from a binary Ising model. 
In addition, we included corruptions due to varying levels of missing data in the plots of Fig- 
ure 3. In the fixed-degree cases, we see that the curves roughly stack up for different problem 
sizes p when the probability of correct recovery is plotted against the rescaled sample size 
j^j^i agreeing with our theory. 

In Figure 3(a), we simulated the graphical Lasso method described in Section 4.1 applied to 
the dinosaur graph of Figure 1(e). We generated data from an Ising model with node weights 
0.1 and edge weights 0.3 (corresponding to {—1, 1} variables). The curves show the probability 
of success in recovering the 15 edges of the graph, as a function of the rescaled sample size 
where p = 13. In addition, we performed simulations for different levels of missing data, 
specified by the parameter p £ {0,0.05,0.1,0.15,0.2}, using the corrected estimator (23). 
Note that all five runs display a transition from success probability to success probability 1 
in roughly the same range, as predicted by our theory. Indeed, since the dinosaur graph has 
only singleton separators, Corollary 2 ensures that the inverse covariance matrix is exactly 
graph-structured, so our global recovery method is consistent at the population level. Further 
note that the curves shift right as the fraction p of missing data increases, since the recovery 
problem becomes incrementally harder. 

Panels (b) and (c) of Figure 3 show the results of the nodewise regression method of Sec- 
tion 4.2 applied to chain and star graphs, with increasing numbers of nodes p £ {32, 64, 128} 
and p £ {64, 128,256}, respectively. For the chain graphs in panel (b), we set node weights 
of the Ising model equal to 0.1 and edge weights equal to 0.3. For the varying-degree star 
graph in panel (c), we set node weights equal to 0.1 and edge weights equal to where the 
degree d of the central hub grows with the size of the graph as [logpj • Again, we show curves 
for different levels of missing data, p G {0,0.1,0.2}. The modified Lasso program (24) was 
optimized using a form of composite gradient descent due to Agarwal et al. [1], guaranteed to 



problem decreases from 0(p d ) to \Cd 




which has a significant 
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converge to a small neighborhood of the optimum even when the problem is non-convex [22]. 
In both the chain and star graphs, the three curves corresponding to different problem sizes 
p at each value of the missing data parameter p stack up when plotted against the rescaled 
sample size. Note that the curves for the star graph stack up nicely with the scaling ^ ^ , 
rather than the worst-case scaling n x (i 3 logp, corroborating the remark following Proposi- 
tion 1. Since d = 2 is fixed for the chain graph, we use the rescaled sample size in our 
plots. 

Finally, Figure 4 shows the results of our nodewise regression method of Section 4.3 applied 
to two classes of non-tree graphs. In panel (a), we generated an Erdos-Renyi graph with edge 
probability |, for three different values of p G {64,128,256}. In panel (b), we used a square 
grid graph with dimensions 8 x 8, 12 x 12, and 16 x 16. We used node weights 0.1 and edge 
weights 0.3 in both cases. The plots show the probability of correct neighborhood recovery 
for a randomly-chosen node in each graph. To save on computation, we employed the neigh- 
borhood screening method described in Section 4.3 to prune the candidate neighborhood set 
before performing the linear regression. Since the number of neighbors d was known a priori, 
we selected a candidate neighborhood set of size |_2.5cZJ with highest empirical correlations, 
then performed a single regression against all singleton nodes and products of subsets of the 
candidate neighborhood set of size at most d, via the modified Lasso program (24). The size 
of the candidate neighborhood set was tuned through repeated runs of the algorithm. We 
see that the success probability increases from to 1 as the sample size increases for the 
curves in both panels, illustrating that our algorithm is consistent for neighborhood recovery. 
Furthermore, the curves for different values of p stack up when plotted against the rescaled 
sample size as predicted by our theory. 



5 Proof of sample-based regression result 

In this section, we provide a proof of our main nodewise recovery result, Proposition 1. 
For proofs of supporting technical lemmas and all corollaries appearing in the text, see the 
Appendix. 

We derive Proposition 1 as a consequence of a more general theorem. Suppose we have 
i.i.d. pairs of observations {(xj, 2/i)}f =1 , with Xi G M p and j/j G M, and we wish to estimate 
the best linear predictor j3 = S^ 1 Cov(x,, j/j), when [3 is /c-sparse. Loh and Wainwright [22] 
formulated a modified version of the Lasso based on possibly corrupted observations; however, 
they assume the linear regression model 

yi = xjl3 + ei, (31) 

where q is sub-Gaussian noise and ei _LL Xj. Note that although the model (31) holds in the 
case where yi is a sample from a single node and Xi is a sample from all other nodes in a 
Gaussian graphical model, the model (31) does not hold in a general discrete graphical model. 
Nonetheless, we show that essentially the same Lasso estimator provides an estimator for (3 
that is consistent for support recovery. Suppose the pair (r,7) in the Lasso program (24) 
satisfies the following deviation bounds: 



||r/?-7ll«,<m/— > ( 32a ) 

n 



(r-S*Hoo<^WooA/^^ V«G Bi(8fc)nBoo(l), (32b) 

n 
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success prob vs. sample size for dino graph with missing data 
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(a) Dino graph with missing data 



(b) Chain graph with missing data 



success prob vs. sample size for star graph, rho = 0, 0.1 , 0.2 
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n/d log p 



(c) Star graph with missing data, d ~ \ogp 



Figure 3. Simulation results for global and nodewise recovery methods on binary Ising models, 
allowing for missing data in the observations. Each point represents an average over 1000 trials. 
Panel (a) shows simulation results for the graphical Lasso method applied to the dinosaur graph. 
Panel (b) shows simulation results for nodewise regression applied to chain graphs for varying 
p and p. Panel (c) shows simulation results for nodewise regression applied to star graphs with 
maximal node degree logp and varying p. The horizontal axis gives the rescaled sample size 

n 

d' 2 log p ' 



for some ipi,<f2- Also suppose T satisfies the lower-restricted eigenvalue (RE) condition: 

v T Fv > a\\v\\2 Vf s.t. \\v ||i < \/A;||i;||2- (33) 
Then we have the following technical result: 

Theorem 2. Suppose the pair (T,j) satisfies the deviation conditions (32a) and (32b), as 
well as the lower- RE condition (33). Also suppose n ^ max j y ^ , (p\ IS" 1 1^ i k logp 

and \ n y (piy f^. Then any optimum (3 of the Lasso program (24) satisfies 

||^-^||oo<4A n IIS- 1 !^. 

The proof of Theorem 2 is provided in the Appendix. In order to prove Proposition 1, we 
first establish that the deviation conditions (32a) and (32b) of Theorem 2 hold w.h.p. with 
(tpi, ip2) = (^ll/^lbi V 9 )) an d the lower-RE condition holds with a = ^A m i n (S x ). 
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success prob vs. sample size for Erdos-Renyi graph 
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(a) Non-tree Erdos-Renyi graph 
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(b) Grid graph 



Figure 4. Simulation results for nodewise recovery methods on binary Ising models for non- 
tree graphs. Each point represents an average over 500 trials. Panel (a) shows simulation 
results for nodewise regression applied to a non-tree Erdos-Renyi graph with edge probability 
3/p. Panel (b) shows simulation results for nodewise regression applied to a square grid with 
varying numbers of nodes. The horizontal axis gives the rescaled sample size j- 2 — ■ 
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II173-7II0C < ||(r -E a .)j0|| oo + || Cov^,^)- 7lU- 
X T X 



(34) 



Cov(xi,yi) - j\ 



n 



< 



X T y 



n 



E( Xi xi ) (3 



+ XX 



+ \\yx-E(y l )E(x i )\\ c 



As in the analysis of inequality (49), we may disregard the two second terms involving empir- 
ical means, since they concentrate at a fast rate. Since xj (3 is sub-Gaussian with parameter 
<£> 2 ||/3||! by assumption, and ejxi and yi are clearly sub-Gaussian with parameter 1, the devi- 
ation condition (32a) follows with = y||/3||2 by standard concentration bounds on an i.i.d. 
average of products of sub-Gaussians (cf. Lemma 14 of Loh and Wainwright [22]). 

For the second deviation bound, we will verify the bound over a more tractable set via 
the following lemma: 



Lemma 5. For any constant cq > 0, we have 

Bi(c A:) nBoo(l) C (l + c )cl{conv{B (fc)n: 



»(!)}}■ 



Hence, it is sufficient to establish the deviation inequality (32b) over the set Bo(A;)nB 00 (l). 
We proceed via a discretization argument. Suppose {v\, . . . ,vm} is a ^-covering of the unit 
•£oo-ball in M fc in its own metric. By standard results on metric entropy, we know that such 
a covering exists with M < c k . Writing ip(v) = \\(T — Tj x )v\\oo, we know that there exists Vj 
such that \\v — Vs\\aa < h- Let Av = v 



Vj. Then 



1 



ip(v) = ||(r - S z )(uj + Aw) II oo < tp(vj) + ip(Av) < sup ip(vj) + - sup ip(v), 



1<3<M 



IM|oo<l 
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simply by rescaling. Taking the sup over {||t»||oo < 1} on the LHS and rearranging then yields 

sup < 2 sup 4>{vj). 

|M|oo<l i<j<M 

Hence, it suffices to establish the bound for a given v S Bi(cofc) nBoo(l), then take a union 
bound over the M < c k elements in the discretization and the (?) < p k choices of the support 
set. 

For a given £>sparse v, note that xjv has sub-Gaussian parameter v^lMli by assumption, 

and 

IMl! < IMI lll^Hoo — vfc||f ||2 Halloo; 

so xjv is sub-Gaussian with parameter y^/cHuH^. Since ejxi is sub-Gaussian with parameter 
1, it follows from the same recentering techniques as in inequality (49) that 

||(f - T, x )v\\oo = max \ef(f - H x )v\ < t 
with probability at least 1 — c\ exp ^^rpp-^ . Taking a union bound over the discretization 



and setting t = c<p\/k\\v\\ 00 y fcl ° gp then implies the deviation bound (32b) with ipi = <p, under 
the scaling n ^ ip 2 k 2 log p. 

The lower-RE condition (33) may be verified analogously to the results in Loh and Wain- 
wright [22]. The only difference is to use the fact that xjv is sub-Gaussian with parameter 
9? 2 ||v||! in all the deviation bounds. Then the lower-RE condition holds with probability at 
least 1 — ci exp(— 02k logp), under the scaling n £3 (p 2 k\ogp. 

We may take A n x ^ll/^lbv/^ 2 in Theorem 2 to conclude that w.h.p., 



n 



Finally, note that the vector (3 is exactly equal to column s of the inverse covariance matrix 
r, according to the following lemma (a straightforward application of block matrix inversion): 

Lemma 6. Let (Yi, . . . ,Y m ) be a zero-mean vector with covariance matrix S, and let S C 
{1, ... , m}. The best linear predictor of Yg upon Y\g, defined as Er^vgE^g, is a matrix 
multiple of (S" 1 )^^. 

Hence, combining Corollary 1 and Theorem 2 implies that thresholding succeeds w.h.p. 
for neighborhood recovery in a tree graph. 



6 Discussion 



The correspondence between the inverse covariance matrix and graph structure of a Gauss- 
Markov random field is a classical fact, one with many useful consequences for estimation 
of Gaussian graphical models. It has been a long-standing open question as to whether or 
not similar properties extend to a broader class of graphical models. In this paper, we have 
provided a partial affirmative answer to this question, and moreover have developed theoretical 
results extending such relationships to discrete undirected graphical models. 

As shown by our results, the inverse of the ordinary covariance matrix is graph-structured 
for special subclasses of graphs with singleton separator sets. More generally, we have shown 
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that it is worthwhile to consider the inverses of generalized covariance matrices, formed by 
introducing indicator functions for larger subsets of variables. When these subsets are cho- 
sen to reflect the structure of an underlying junction tree, the edge structure is reflected 
in the inverse covariance matrix. Our population-level results have a number of statistical 
consequences for graphical model selection. We have shown how our results may be used to 
establish consistency (or inconsistency) of certain standard methods for discrete graph selec- 
tion, and have proposed new methods for neighborhood recovery that may be applied even 
when observations are systematically corrupted by mechanisms such as additive noise and 
missing data. Furthermore, our methods are attractive in their simplicity, in that they only 
involve solving basic programs such as linear regression. 
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A Proofs of supporting lemmas for Theorem 1 

In this section, we supply the proofs of more technical results that underlie the proof of 
Theorem 1, including Lemmas 1, 3 and 4. 

A.l Proof of Lemma 1 

The regularity of the family (7) is immediate, since the log normalization constant 

A(0) :=log ex P{ J>°' W*c))} (35) 
xexp cec 

is finite for any 9 £ R D . 

To establish minimality, suppose Ylc J a C;J^-C;j( x c) = b almost surely, where the coeffi- 
cients ac-j are real-valued and b is some constant. Plugging in x such that x s = for all 
s E V and using the fact that all states have positive probability, we see that 6 = 0. Now 
assume that not all ac-j are equal to 0. Let (C, J') be some pair such that ac-,j> ^ and 
\C'\ is minimal. Plugging in x such that xqi = J' and X\ni = 0, we have 

^ac ;J lc;j(xc) = ac>;j>, 
C,J 

by the minimality of \C'\. This contradicts the fact ac";J' 7^ 0- Hence, we conclude that the 
indicator variables are indeed linearly independent, implying that the family (7) is minimal. 

A. 2 Proof of Lemma 3 

By Proposition B.2 of Wainwright and Jordan [31] (cf. Theorems 23.5 and 26.3 of Rockafel- 
lar [28]), we know that the dual function is differentiable on the interior of the marginal 
polytope M. (see equation (11)) with 

V<F» = (V^V) (36) 
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for [i G int(Al). Also, by Theorem 3.4 of Wainwright and Jordan [31], for any fj, G int(TW), 
the negative dual function takes the form <&*(//) = —H(qg(fj,)), where 6>(/i) = (V<J>) -1 (/x). 

From Lemma 2, the log partition function has derivatives of all orders, so using the 
relation (36) and the implicit function theorem, we see that <f>* is also twice-differentiable on 
int(A4). Moreover, the relation (36) implies 

(V$)(V$*(//)) = fi for all \i G M. 

Since this equation holds on an open set, we may take derivatives; employing the chain rule 
yields 

(V 2 $)(V<T(/i)) • (V 2 <T(/i)) = I DxD . 
Rearranging yields the relation V 2( 3?*(/i) = (V 2 $(0))~ 1 \ g=e ^, as claimed. 

A. 3 Proof of Lemma 4 

We prove this proposition by induction on the subset size. For sets of size 1, the claim 

is immediate, as illustrated in the explicit construction (17). Suppose the claim holds for 

all subsets up to some size k > 1, and consider a subset of size k + 1, which we write as 

lei 

C = {l,...,/c + l}, without loss of generality. For any configuration J G Xq , the marginal 
probability qc(xc = J) is equal to pc-,j, by construction. Consequently, we need only specify 

how to determine the probabilities qc{xc = J) for configurations J G X\ c \\Xq . By the 
lei 

definition of Xq , each j G J has j s = for at least one s G {1, . . . , k + 1}. 

We show how to express the remaining marginal probabilities sequentially, inducting on 
the number of positions s for which j s = 0. Starting with the base case in which there is a 
single zero, suppose without loss of generality that jk+i = 0. For each t G {1, 2, . . . , m — 1}, 
let J e be the configuration such that J- = Ji for all i ^ k + 1 and = i- Defining 

D := C\{k + 1}, we then have 

m— 1 

qc{xc = J) = qD { x d = J') - ^2 ^c( x c = j£ )> ( 37 ) 

where J' G X k is the configuration defined by J[ = Ji for all i = 1,2, ... ,k. Since \D\ = 
k, our induction hypothesis implies that qn{xD — is a linear function of the specified 
mean parameters. Moreover, our starting assumption implies that G X^ for all t = 
{1, 2, . . . , m — 1}, so we have qc{%c = J ) = Pc-J 1 - This establishes the base case. 

Now suppose the sub-claim holds for all configurations with at most t nonzeros, for some 
t > 1. Consider a configuration J with t + 1 zero entries. Again without loss of generality, we 
may assume jk+i = 0, so equation (37) may be derived as before. This time, the configurations 

are not in X^ (since they still have t > 1 zero entries); however, our induction hypothesis 
implies that the corresponding probabilities may be written as functions of the given mean 
parameters. This completes the inductive proof of the inner claim, thereby completing the 
outer induction, as well. 

B Proofs of population-level corollaries 

In this Appendix, we prove Corollaries 1 and 3. (As previously noted, Corollary 2 is an 
immediate consequence of Corollary 1.) 
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B.l Proof of Corollary 1 

Recall that C denotes the set of all cliques in the triangulation G. The covariance matrix in 
Theorem 1 is indexed by C, and our goal is to define appropriate blocks of the matrix and 
then apply the matrix inversion lemma [14]. Consider the collection pow(5). We define the 
collection of singleton subsets V = {{1}, {2}, . . . , {p}}, and introduce the disjoint partition 

C = (pow(S) U V) U (C\{pow(S) U V}j . 

y v ' V v ' 

u w 
The following property of the collection W is important: 

Lemma 7. For each maximal clique C G C, define the set collection J~(C) = pow(C)\U . For 
any A G W, we have A G F(C) for exactly one C. 

Proof. We first establish existence. Since W C C, any set A G W is contained in some 
maximal clique Ca- Since A ^ U, we clearly have A G F{Ca)- 

To establish uniqueness, consider a set A belonging to the intersection C\ n C2 of two 
maximal cliques. If these cliques are adjacent in the junction tree, then A belongs to the 
separator set C\ n C2, so A cannot belong to W, by definition. Even when C\ and C2 are 
not adjacent, the running intersection property of the junction tree implies that C\ n C2 must 
belong to every separator set on the unique path between C\ and C2 in the junction tree, 
implying that A ^ W, as before. This is a contradiction, implying that the maximal clique 
Ca is unique. □ 

Define F = (cov(*(X;C))) . By the block-matrix inversion formula [14], we may write 

e := (cov (^{x-u)))^ 1 = f{um) - r(w, w)(r(w, w)) _1 r(w,^). (38) 

We need to show that 0(A, B) = for any members A,B €U that do not belong to the same 
maximal clique. By Theorem 1(a), we have F(A,B) = whenever A and B do not belong to 
the same maximal clique, so it remains to show that F(A, W) (r(W, W)) F(W, B) = 0. 

We begin by observing that the matrix r(W, W) is block-diagonal with respect to the 
partition {F{C) : C G C} previously defined in Lemma 7. (Indeed, consider two sets D, E G W 
with D G F{C) and G JF(C') for distinct maximal cliques C ^ C . Two such sets cannot 
belong to the same maximal clique, so Theorem 1(a) implies that F(D, E) = 0.) Since block- 
diagonal structure is preserved by matrix inversion, the inverse T = (F(W, W))" 1 shares this 
property, so for any two members A,B £U, we may write 

F(A,W)(F(W,W)y 1 F(W,B)= £ F(A,T(C))T(T(C),T(C))F(T(C),B). (39) 

T(C),cec 

We claim that each of these terms vanishes. For a given maximal clique C, suppose A is 
not contained within C"; we first claim that F(A,J-(C')) = 0, or equivalently, for any set 
D G JF(C), we have F(A,D) = 0. From Theorem 1(a), it suffices to show that A and 
D cannot be contained within the same maximal clique. From Lemma 7, we know that A 
belongs to a unique maximal clique C. Any set D G J-{C) is contained within C; if it were 
also contained within C, then D would be contained in C D C . But as argued in the proof 
of Lemma 7, this implies that D is contained within some separator set, whence it cannot 
belong to F(C). We thus conclude that r(^4, D) = 0, as claimed. 

Taking any two subsets A and B that are not contained in the same maximal clique, we 
see that for any clique C, we must either have F(A, T(C)) = or r(J r (C),i?) = 0. Hence, 
each term in the sum (39) indeed vanishes, completing the proof. 
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B. 2 Proof of Corollary 3 

This corollary follows by a similar argument as in the proof of Corollary 1. As before, let C 
denote the set of all cliques in the triangulation G, and let V = {{1}, {2}, . . . , {p}}. Define 
U = pow(S(s; d)) U V and W = C\U. 

Let C s := s U N(s), and define a disjoint partition of W by taking T\ := pow(C s )\V( and 
J-2 '■= W\J-\. Note that C s is the unique maximal clique in C containing s. By construction, 
every clique in Ti does not contain s and has more than d elements, whereas every clique in 
J-\ is contained in C s , with \C S \ < d + 1. It follows that no two cliques A € T\ and B £ Ti 
can be contained in the same maximal clique. Denoting T := (cov(^/(X;C)))~ 1 , we conclude 
via Theorem 1(a) that r(W,yV) is block-diagonal. 

We now use the block matrix-equation formula (38). As before, Theorem 1(a) implies that 
T(U,U) is graph-structured according to G. In particular, for any B €U with B C C s , we have 
r({s},i?) = 0. (The elements of IA that are subsets of C s are exactly {s} and the nonempty 
subsets of N(s).) Hence, it remains to show that r({s}, W)(r(W, W))~ 1 r(yV, B) = 0. 

Analogous to equation (39), we may write 

2 

r^wxr^w))- 1 ^,^) = ^2v({ s },T l )T(T i ,T i )r(T ll B), 

i=i 

where T := (r(W, W))" 1 . Applying Theorem 1(a) once more, we see that r(J 7 i,i?) = 0, 
since B C C s and T({s},J-2) = 0. Hence, the matrix = (cov(^(X;U)))~ 1 appearing in 
equation (38) is indeed s-block graph-structured. 

C Proof of Theorem 2 

We begin by outlining the main argument of the proof, with proofs of supporting lemmas in 
the following sections. 

C. l Main argument 

We begin by establishing l\- and £2- error bounds, which will be used in the sequel: 

Lemma 8. Suppose the deviation condition (32a) holds and T satisfies the lower-RE condi- 
tion (33). Also suppose A n £3 V?i \j • Then any global optimum /3 of the Lasso program (24) 
satisfies the bounds 

W-Ph< ^max{^u/^,A„}, (40) 
ci£ V n 



||£-^|i<^maxW^,A„}. (41) 
ct£ V n 

We now argue that for suitable scaling n £3 /elogp, any optimum f3 lies in the interior of 
Bi(6oVfc): 

Lemma 9. Suppose /3 is an optimum of the Lasso program (24). Then under the scaling 
2 

n 



a(bo-\\Ph) 



~ ( 771 ^lall N ) ^ 1°§ Pi We nave 
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By Lemma 9, we are guaranteed that (3 is an interior point of the feasible set. Consequently, 
by Proposition 2.3.2 of Clarke [10], we are guaranteed that is a generalized gradient of 
the objective function at /3. By Proposition 2.3.3 of Clarke [10], there must exist a vector 
z £ <9||/3||i L_s such that 



f^-7 + A n z = 0. 

Denoting the loss function C{(3) = \(3 T Tj3 — j T (3, we have V£(/3) = F/3 — 7, so 
VC0) - VC0) = VC0) + \ n z = f p - 7 + \ n z. 

Then 

\\VC0) - VC0)\\oo < \\TP - 7IU + A n p|| 00 < ||f/3 - 7IU + A n . (42) 
Using the deviation bound (32a) again, we have 



||r73-7lU<^' 



It follows from equation (42) that if A n > </?iy ^p, then 

||V£(^)-V£(^)||oc<2A n . (43) 

Finally, we lower-bound 

||V£(/3)- V£(3)||oc = llf^lloo 

> HSxPlloo - ||(r - s^^Hoo 

> lls^ll^iiPiU-iKr-s^iu. (44) 

Now note that ||P||i < 8\/^||?||2i as shown in Loh and Wainwright [22], so we have 

IN2 < ll^ll 

and dividing through by ||P||2 gives ||z?||2 < In particular, < 8/cH^Hoo. Apply- 

ing inequality (32b) to v = mw~ then gives 



||(r - EaO^lloc < C(f2\\v\ 

Combining this with inequality (44), we have 



k logp 



ri 



|rz?||oo > ||z?||oo ( IN ^-1 III Cp2y lC ^ P j , (45) 



so when n ^ y\ HS^, 1 !^ /clogp, we have 

Tu L, > 



2 P£ 

III - 1 moo 

Finally, combining with inequality (43) yields the result of the theorem. 
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C.2 Proof of Lemma 8 



The proof is essentially the same as in the case of a standard linear model [22]. From the 
fact that (3 is feasible and /3 is optimal, we obtain a basic inequality. Furthermore, defining 
v = (3 — (3, we may verify the cone condition < cv / ^||£'||2- We will not repeat the 

arguments here. 



C.3 Proof of Lemma 9 

Note that 

|^-/3||i>|^||i-||/3||i>||^||i-v^||/3|| 2 . 
Hence, if G dM^boVk), we have 

-Ph> bo^k - \\p\\ 2 Vk = (b - M\2)Vk. (46) 

On the other hand, Theorem 1 in Loh and Wainwright [22] guarantees that under deviation 
condition (32a) and the lower- RE condition (33), we have the ^i-bound 



< «M /k8P (47) 
a V n 



Combining inequalities (46) and (47) gives 



a V n 

contradicting the assumption that n > ( Cip },~,, , \ klogp. 

\a(bo-\\0\\ 2 ) / 

C.4 Proof of Lemma 5 

We denote the left-hand set by A and the right-hand set by B. It suffices to show that 
<Pa(z) < <Pb{z) for all z, where <p is the support function. 

For a given z, let S be the set of indices of coordinates of z with highest absolute value. 
We may write 

<Pa{z) = sup(0, z) 

= sup(6» 5 , zs) + (0 S c, z S c) 
eeA 

< \\zs\\i + co k 1 1 zsc | |oo, (48) 
(Os, Zs) < \\0s\\oo\\zs\\l < ||f Hoolkslll < \\zs\\l 



since 



and 

(9 S c, Z S c) < \\0s°\\l\\ Z S°\\°o < Co^Hzsclloo 

for 6 G A. Furthermore, fc||^s c ||oo < II 111- Hence, inequality (48) becomes 

<Pa(z) < (1 + co)lks||l- 

Finally, note that 

<Pb(z) = (1 + c ) max sup (9u, z v ) = (1 + c )||zs||i, 
|t/|<fc n^iu^i 

establishing the desired result. 
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D Proofs of sample-based corollaries 



Here, we provide proofs for the remaining corollaries involved in sample-based approaches to 
graph selection. 



D.l Proof of Corollary 4 

As noted by Liu et al. [20], the proof of this corollary hinges only on the deviation condi- 
tion condition (20) being satisfied w.h.p.; the rest of the proof follows from the analysis of 
Ravikumar et al. [26]. Hence, we verify inequality (20) with </?(£*) = c\ and ip(n,p) = c'logp. 
Note that 



IS - El 



T — T 

rt* . ry* 'Y' 'Y' 

•Aj 2 "J ii •Aj •Aj 



< 



\ n ^ 

\ i=l 

1 n 

/ , E(XjXj ) 



i=l 



+ \\xx T -¥,{xi)¥,{xif 



(49) 



where we have used the triangle inequality and the fact that X = E(xixf) — E(xj)E(xj) T 
in the second line. Noting that ||y|| m ax — maxj^ \ejYek\ for a matrix Y, and the random 
variables ejxi are i.i.d. Bernoulli (sub-Gaussian parameter 1) for each fixed j, we conclude 
by standard sub-Gaussian tail bounds (cf. Lemma 14 in Loh and Wainwright [22]) that the 

first term is bounded by \f^, with probability at least 1 — cexp(— d log p). For the second 
term, we may further bound 

\\xx T - E(xi)E( Xi ) T \\ max <\\(x- E( Xi ))(x - E(xi)) T \\ max + 2||E(x i )|| 0O ||x - E(xi)||oo, 

by way of the triangle inequality. Note that ej(x — E(xj)) is an average of i.i.d. sub-Gaussian 
variables with parameter 1, hence has sub-Gaussian parameter ^. Therefore, we have the 

even tighter bound ^^/^rP ^ or ^ n ^ s term. Combining the bounds for the two terms in in- 
equality (49) establishes the deviation condition (20). 

By the machinery of Ravikumar et al. [26], we then have the elementwise bound 

P[||G - 9*|| max > T n ) < cexp(-c'logp). 
The statement about thresholding to obtain a consistent estimate of G* follows immediately. 



D.2 Proof of Corollary 5 

The analysis borrows techniques from the paper [6]. We first prove that under the scaling 
n y n 2 logp, we have \rc(s,t) — rc(s,t)\ < j for all (s,t) G V x V, with probability at least 
1 — ci exp(— C2 logp). First fix a pair (s,t) and a corresponding pair of values (x s ,Xt)- By a 
simple application of Hoeffding's inequality, we have 

P (\F(X a = x s ,X t = x t )-P(X s = x s ,X t = x t )\ > e) < cexp(- C / ne 2 ), 
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and similarly for the deviations \F(X S = x s ) - F(X S = x s )\ and |P(X t = x t ) - F(X t = x t )\. 
Note that 

\r C (s,t)-r C (s,t)\ < (\F(X S = x s ,X t = x t ) -F(X S = x s ,X t = x t )\ 

+ \F(X S = x s )F(X t = x t ) - F(X S = x s )F(X t = x t )\ 

Furthermore, 

\F(X S = x s )F(X t = x t ) - F(X S = x s )F(X t = x t )\ < \F(X S = x s ) - F(X S = x s )\- F(X t = x t ) 

+ \F(X t = x t ) - F(X t -x t )\- F(X S = x s ) 
<2e, 

so taking a union bound over all pairs (s, t) and all values (x s ,xt), we have \rc(s,t) — rc(s,t)\ < 
3m 2 e for all (s,t) £ V x V, with probability at least 1 — cm 2 p 2 exp(— c'ne 2 ). Finally, taking 
e = 12m 1 anc ^ usm § the fact that n £3 n 2 logp gives the desired bound, with probability at 
least 1 — ci exp(— C2 logp). 

In particular, it follows that 

N(s)CCC {teV :r c (s,t)>^}, 

with probability at least 1 — c\ exp(— C2 logp). Since the last subset has cardinality at most 

log(4/«) log(4/«Q 

d c by the correlation decay condition, we also have \C\ < d c , as claimed. 

The remainder of the proof is identical to the proof of Proposition 1, and is a consequence 
of Theorem 2. 
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